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Abstract. We prove that Nakhleh's latest 'metric' for phylogenetic networks sep- 
arates distinguishable phylogenetic networks, and that a slight modification of it 
provides a true distance on the class of all phylogenetic networks. 



1 Introduction 

L. Nakhleh has recently proposed a dissimilarity measure for the comparison of phyloge- 
netic networks [4] , but he has only proved that it satisfies the separation axiom for metrics 
(zero distance means isomorphism) on the class of all reduced phylogenetic networks in 
the sense of [3]. And, although we show that this measure separates phylogenetic networks 
more general than the reduced ones (for instance, the tree-child phylogenetic networks [1]), 
it does not satisfy the separation axiom on the whole class of all phylogenetic networks 
(see Remark 1 below). 

In this note we complement Nakhleh's work in two directions. On the one hand, we 
prove that, for this dissimilarity measure, zero distance implies indistinguishability up to 
reduction in the sense of [3], a goal that had already been pursued by Moret-Nakhleh- 
Warnow et al in loc. cit., failing in their attempt [2]. In this way, and to the best of our 
knowledge, Nakleh's dissimilarity measure turns out to be is the first one that separates 
distinguishable networks. And, on the other hand, we show that a slight modification 
of Nakhleh's definition does yield a true distance on the whole class of all phylogenetic 
networks. Again to the best of our knowledge, this is the first true metric defined on this 
class. 



1.1 Notations 

Let N = (y, E) be a DAG (a finite directed acyclic graph). We say that a node v ^ V is a 
child of M S y if (m, v) € E; we also say then that u is a parent of v. We say that a node is 
a tree node when it has at most one parent, and that it is a hybrid node when it has more 
than one parent. A node that is not a leaf is called internal. A DAG is rooted when it has 
only one root: a node without parents. 

A path in is a sequence of nodes {vq,vi, . . . ,Vk) such that (ui_i,Ui) e E for all 
i = 1, . . . , fc. We call vq the origin of the path, wi, . . . , Vk-i its intermediate nodes, Vk its 
end, and k its length. We denote hy u-^v any path with origin u and end v and, whenever 
there exists a path u-^v, we say that ?; is a descendant of u. 

The height h{v) of a node w in a DAG N is the largest length of a path from w to a 
leaf. The absence of cycles implies that the nodes of a DAG can be stratified by means of 
their heights: the nodes of height are the leaves, the nodes of height 1 are those nodes 



all whose children are leaves, the nodes of height 2 are those nodes all whose children are 
leaves and nodes of height 1, and so on. If a node has height m, then all its children have 
height smaller than m, and at least one of them has height exactly m — 1. 

Given a finite set S, an S-DAG is a DAG whose leaves are bijcctively labeled by 
elements of S. We shall always identify, usually without any further notice, each leaf of 
an S'-DAG with its label. Two S'-DAGs N,N' are isomorphic, in symbols N = N' , when 
they are isomorphic as directed graphs and the isomorphism preserves the leaves' labels. 

A phylogenetic network on a set S of taxa is a rooted S'-DAG. 

For every node ii of a phylogenetic network N = (V.E), let C(u) be the set of all its 
descendants in and N{u) the subgraph of N supported on C{u): it is still a phylogenetic 
network, with root u and leaves labeled in the subset Cl(u) C S of labels of the leaves 
that are descendants of u. We shall call N(u) the rooted subnetwork of N generated by u, 
and the set of leaves Cl{u) the cluster of u. 

A clade of a phylogenetic network is a rooted subnetwork of N all whose nodes are 
tree nodes in N (and, in particular, it is a rooted tree). 

1.2 Moret-Nakhleh-Warnow-et ai's reduction process 

Let N = {y, E) be a phylogenetic network on a set S of taxa. A subset U of internal nodes 
of V is said to be convergent when it has more than one element, and all nodes in it have 
exactly the same cluster. 

The removal of convergent sets is the basis of the reduction procedure introduced in [3]: 

(0) Replace every clade by a new 'symbolic leaf labeled with the names of all leaves in it. 

(1) For every maximal convergent set U, remove all internal descendants of its nodes 
(including the nodes of U). 

(2) For every remaining node x that was a parent of a removed node v, add a new arc 
from X to every (symbolic) leaf in Cl{v). 

(The resulting DAG contains no convergent set of nodes, because this step does not 
change the clusters of the surviving nodes.) 

(3) Append to every symbolic leaf representing a clade the corresponding clade, with an 
arc from the symbolic leaf to the root of the clade, and remove the label of the symbolic 
leaf. 

(4) Replace every node with only one parent and one child by an arc from its parent to 
its only child. 

(Since the DAG resulting from (2) contains no set of convergent nodes, it contains 
no node with only one child. Therefore the only possible nodes with only one parent 
and one child after step (3) are those that were symbolic leaves with only one parent. 
These are the only nodes that have to be removed in this step.) 

The output of this procedure applied to a phylogenetic network A'' on S* is a (non nec- 
essarily rooted) 5-DAG, called the reduced version of N and denoted by R{N). A network 
N is reduced when R{N) = N . It should be noticed that the only possible convergent 
sets in R{N) consist of a hybrid node and its only child (more specifically, the hybrid 
node corresponding to a symbolic leaf with more than one parent, and the root of the 
corresponding clade) [2]. 

Two networks A^i and are said to be indistinguishable when they have isomorphic 
reduced versions, that is, when R{Ni) = R{N2). Moret, Nakhleh, Warnow, et al argue in 
[3, p. 19] that for reconstructiblc phylogenetic networks this notion of indistinguishability 
(isomorphism after simplification) is more suitable than the existence of an isomorphism 
between the original networks. 



2 Nakhleh's 'metric' 



Nakhleh defines in [4] an equivalence on the set of nodes of a pair of S'-DAGs inductively 
as follows. 

Definition 1. Let Ni = {Vi,Ei) and N2 = (V2,i?2) be S-DAGs (not necessarily differ- 
ent). Two nodes u € Fi and v &V2 are equivalent, in symbols u = v, when: 

— u and V are both leaves labeled with the same taxon, or 

— for som,e k ^ 1, node u has exactly k children ui, . . . , Uk, node v has exactly k children 
vi, . . . , Vk, and Ui = Vi for every i = 1, . . . , fc. 

The following characterization of node equivalence will be useful. 

Definition 2. Let N = {V,E) be a DAG. The nested labeling e{v) of the nodes v of N 
is defined by induction on h{v) as follows: 

— If h{v) = 0, that is, if v is a leaf, then £{v) = {v}, the one-element set consisting of 
its label. 

— Ifh{v) = m > 0, then all its children Vi, . . . ,Vk have height smaller then m, and hence 
they have been already labeled: then, £{v) is the multiset of their nested labels, 

l{v) = {£M,...,i{vk)}. 

Notice that the nested label of a node is, in general, a nested multiset (a multiset 
of multisets of multisets of. . . ), hence its name. Moreover, the height of a node u is the 
highest level of nesting of a leaf in £{u) minus 1. 

Proposition 1. Let Ni = (Vi,Ei) and N2 = {V2,E2) be DAGs (not necessarily different) 
labeled in a set S. For every m G T^i and v £ V2, u = v if, and only if, £{u) = £{v). 

Proof. We prove the equivalence by induction on the height of one of the nodes, say u. 

If h{u) = 0, then it is a leaf, and £{u) is the one-element set consisting of its label. 
Thus, in this case, u = v \i, and only if, v is the leaf of N2 with the same label as u, and 
£(u) = £{v) if, and only if, v is the leaf of N2 with the same label as u, too. 

Consider now the case when = m > and assume that the thesis holds for all 
nodes u' e V\ of height smaller than m. Let ui,...,Uk be the children of u. Then: 

— u = V \i and only if v has exactly k children and they can be ordered Vi, . . . ,Vk in such 
a way that Ui = Vi for every i = 1, . . . ,k. 

— £{u) = £{v) if and only if v has exactly k children and the multiset of their nested labels 

is equal to the multiset of nested labels of ui, . . . , u^, which means that w's children 
can be ordered wi, . . . , in such a way that £{ui) — £{vi) for every i = 1, . . . , k. 

Since, by induction, the children of u satisfy the thesis, it is clear that u = v is equivalent 
to £{u) = £iv). 

We shall say that a nested label £{v) is contained in a nested label £{u), in symbols 
£{v) =4 £{u), when £{v) is the nested label of a descendant of u. Notice that if £{v) is 
contained in £{u), then v is equivalent to some descendant of u, but v itself need not be a 
descendant of u: several instances of this fact can be detected in the networks depicted in 
Fig. 1. Notice moreover that £{v) e £{u) if, and only if, £{v) is the nested label of a child 
of u. 

Nakhleh defines in [4] the following dissimilarity measure. 



Definition 3. For every S-DAG N, let T{N) be the multiset of equivalence classes of its 
nodes (where each equivalence class appears with multiplicity the number of nodes in it). 

Definition 4. For every pair of phylogenetic networks Ni and N2 on the same set S of 
taxa, let 

m{m,N2) = ^\r{Ni) Ar{N2)\, 

where A denotes the symmetric difference of multisets: if a class belongs to T{Ni) with 
multiplicity a and to T{N2) with multiplicity b, then it contributes \a—b\ to \T{Ni)AT{N2)\. 

Notice that Y{N) can be also understood as the multiset of nested labels of the nodes 
of N, each nested label appearing with multiplicity the number of nodes labeled with it. 

Lemma 1. Let Ni and N2 be two S-DAGs such that no one of them contains any pair of 
equivalent nodes. Then, m{Ni, N2) — if, and only if Ni = N2. 

Proof Let R{Ni) = {Vi,Ei) and R{N2) = (1^2,-^2). If neither A^i nor N2 contain any 
pair of equivalent nodes, then T{Ni) and T{N2) are sets, and the quotient mappings 
Vi T{Ni) are bijections, for i = 1,2. 

Now, assume that \r{Ni) A T{N2)\ = 0. Then T{Ni) = T{N2) and hence there exists 
a well-defined bijection a : Vi ^ V2 that sends each node in Ni to the only node in N2 
equivalent to it. In particular it sends each leaf of A'^i to the leaf of A^2 with the same 
label. To see that a is an isomorphism of graphs, let {u,v) £ Ei be any arc in Ni. Since 
u = a{u), the node a{u) must have a child equivalent to v, and since N2 does not contain 
any pair of equivalent nodes, this child is ci{v), which implies that {a{u), a{v)) G £2- This 
shows that a preserves arcs, and a similar argument applied to : V2 -^V\ shows that 
it also reflects them. This proves that a : Ni ^ N2 is an isomorphism of 5-DAGs. 

The converse implication is obvious. 

A first consequence of this lemma is the following result, which is essentially Theorem 
2 in Nakhleh's paper [4]. 

Proposition 2. Let R{Ni) and R{N2) be the reduced versions of two phylogenetic net- 
works on the same set S of taxa. Then, m{R{Ni),R[N2)) = if, and only if, R{Ni) = 

RiN2). 

Proof. The reduced version of a phylogenetic network does not contain any pair of equiv- 
alent nodes [4, Obs. 2]. 

Corollary 1. Let Ni and N2 be two reduced phylogenetic networks on the same set S of 
taxa. Then, m{Ni,N2) = if, and only if, Ni = N2. 

Another type of phylogenetic networks not containing any pair of equivalent nodes are 
the tree- child phylogenetic networks: phylogenetic networks where every internal node has 
a child that is a tree node. Tree-child phylogenetic networks were introduced in [1], and 
a metric and an alignment method for them was proposed, and they have been recently 
proposed by S. J. Willson as the class where meaningful phylogenetic networks should be 
searched [7]. 

Proposition 3. A tree-child phylogenetic network does not contain any pair of equivalent 
nodes. 



Proof. Let u and v be two nodes of a tree-child phylogenetic network iV. If u = i;, then 
Cl{u) — Cl{v) and h(u) = h{v). Let now s be a leaf for which there exists a path u-^s 
with all intermediate nodes and s itself tree nodes (which exists by [2, Lem. 2]). Then 
s G Cl{u) = Cl{v), which imphes that there is also a path v s. By [2, Lem. 1], this 
implies that u and v are connected by a path. If they have moreover the same height, they 
must be the same node. 

Corollary 2. Let Ni and N2 be two tree-child phylogenetic networks on the same set S 
of taxa. Then, m{Ni, N2) ~ if and only if Ni = 

Remark 1. It is false in general that if two arbitrary phylogenetic networks Ni and N2 on 
the same set S of taxa are such that m{Ni,N2) = 0, then A^i = For instance, it is 
easy to check that the networks depicted in Fig. 1 have the same multisets T, but they 
are not isomorphic. 




Fig. 1. These phylogenetic networks have the same multisets of equivalence classes of nodes, but 
they are not isomorphic 

Now, it turns out that this metric m separates networks that are distinguishable up to 
reduction. We would like to recall here that this was the (unaccomplished [2]) goal of the 
error metric defined in [3]. 

Theorem 1. Let Ni and N2 be two phylogenetic networks on the set S of taxa. Ifm{Ni, N2) = 
0, then Ni and N2 are indistinguishable. 

Proof. In this proof, we shall take T{N) as the multiset of nested labels of a network N . Let 
TVi = {Vi,Ei) and N2 = {V2, E2) be two phylogenetic networks such that T{Ni) = T{N2). 
We shall prove that the reduction process of both networks modifies exactly in the same 
way their multisets of nested labels, and thus the reduced versions R{Ni) and R{N2) also 
have the same multisets of nested labels. Then, by Proposition 2, the latter are isomorphic. 



To begin with, notice that two nodes are convergent when the set of ^-labels appearing 
in their nested labels are the same (without taking into account nesting levels or multiplic- 
ities). In particular, jVi and N2 have the same sets of nested labels of convergent nodes. 

Step (0) in the reduction process consists of replacing every clade by a symbolic leaf. 
This corresponds to remove the nested labels of the nodes belonging to clades (except their 
roots) and to replace, in all remaining nested labels, each nested label of a root of a clade 
by the label of the corresponding symbolic leaf. We must prove now that we can decide 
from the multisets of nested labels alone which are the nested labels of nodes of clades and 
of roots of clades. 

Since the clades of a phylogenctic network are subtrees, a node belonging to a clade is 
only equivalent to itself (if f is a node of a clade and v = u, then Cl{u) = Cl{v), but in 
this case, since v is the least common ancestor of Cl{v) in the clade it belongs, v must be 
a descendant of u, and since u and v have the same height — because they are equivalent — 
they must be the same node). In particular, a node of a clade does not share its nested 
label with any other node. 

Then, the nested labels of nodes v E Vi belonging to some clade of Ni (i = 1,2) 
are characterized by the following two properties: i{v) and each one of the nested labels 
contained in it appear with multiplicity 1 in T{Ni) = T{N2) (and in particular v and 
its descendants arc characterized by their nested labels); and £{v) and each one of the 
nested labels contained in it belong at most to one nested label (this means that v and its 
descendants are tree nodes, and in particular that the rooted subnetwork generated by v 
is a tree consisting only of tree nodes from Ni). And therefore the roots of clades of Ni are 
the nodes v with nested label i{v) maximal with these properties, and the nodes of the 
clade rooted at v are those nodes with nested labels contained in i{v). This shows that 
the nested labels of roots of clades and the nested labels of nodes belonging to clades in 
Ni arc the same as in N2. 

So, we remove the same nested labels in Ni and N2 and we replace the same nested 
labels by symbolic leaves. As a consequence, the networks resulting after this step have 
the same nested labels. 

In step (1), all nodes that are convergent with some other node are removed, and 
all nodes other than symbolic leaves that are descendant of some removed node are also 
removed. So, in this step wc remove the nested labels of convergent nodes, and the nested 
labels other than singletons that are contained in some nested label of convergent node 
(notice that if i{v) is not a singleton and it is contained in £{u) and u is convergent, 
then either w is a descendant of u, and then it has to be removed, or it is equivalent to a 
descendant of u, and then it forms a convergent set with this descendant and it has to be 
removed, too). This shows that the nested labels of the nodes removed in both networks 
are the same, and hence that the nested labels of the nodes that remain in both networks 
are also the same. 

In step (2), the paths from the remaining nodes to the labels are restored. It means 
to replace in each remaining nested label £{x), each maximal nested label £{v) =4 ^{x) of 
a removed node v by the singletons {si}, {,52}, . . . , {.Sp} of the symbolic leaves appearing 
in £{v). Again, this operation only depends on the nested labels, and therefore after this 
step the resulting DAGs have the same multisets of nested labels. 

In step (3), clades arc restored. This is simply done by replacing in the nested labels 
each symbolic leaf s by the nested label of the root of the clade it replaced, between 
brackets (because we append it to the node corresponding to the symbolic leaf). Since the 



same clades were removed in both networks and replaced by the same symbohc leaves, 
after this step the resulting DAGs still have the same multisets of nested labels. 

Finally, in step (4), the nodes with only one parent and only one child are removed. 
This corresponds to remove nested labels of the form {{. . .}} that are children of only one 
parent (that is, that belong to only one nested label), and hence the same nested labels 
are removed in both DAGs. 

So, at the end of this procedure, the resulting DAGs R{Ni) and i?(A^2) have the 
same multisets of nested labels. By Proposition 2, this implies that R{Ni) and R{N2) are 
isomorphic. 

The converse implication is, of course false: since the reduction process may remove 
parts with different topologies that yield differences in the multisets of equivalence classes, 
two phylogcnetic networks with isomorphic reduced versions may have different multisets 
of equivalence classes. 

The value rn{Ni, N2) can be computed in time polynomial in the sizes of the networks 
Ni,N2 by performing a simultaneous bottom- up traversal of the two networks [5, 6] 

3 A metric for arbitrciry phylogenetic networks 

If instead of the equivalence classes of the nodes (or, equivalently, their nested labels) 
we consider the whole rooted subnetworks generated by the nodes, we can define a true 
distance on the whole class of all phylogenetic networks. 

Remark 2. It is clear that if u and v are two nodes of two phylogenetic networks Ni and 
N2, respectively (it can happen that Ni = N2), such that the rooted subnetworks Ni{u) 
and N2{v) generated by them are isomorphic, then u = v (because the equivalence can be 
computed within these rooted subnetworks). But the converse implication is false: node 
equivalence in phylogenetic networks does not imply isomorphism of the rooted subnet- 
works. Consider for instance the non-isomorphic phylogenetic networks depicted in Fig. 1: 
it is easy to check that their roots are equivalent. 

Definition 5. For every S-DAG N, let S{N) be the multiset of isomorphism classes of 
the rooted subnetworks generated by its nodes. 

Definition 6. For every pair of phylogenetic networks Ni and N2 on the same set S of 
taxa, let 

a{NuN2) = ^\S{Ni) A S{N2)\, 
where A denotes the symmetric difference of multisets. 

Tiieorem 2. Let Ni and N2 be two phylogenetic networks on the same set S of taxa. 
Then, a{Ni,N2) = if and only if Ni^N2. 

Proof Assume that (7{Ni,N2) = 0, that is, E{Ni) = E{N2). Since each Nt is its rooted 
subnetwork generated by its root, we conclude that A^i contains a rooted subnetwork iso- 
morphic to N2 and N2 contains a rooted subnetwork isomorphic to A''i . The only possibility 
is then that Ni and N2 are isomorphic (otherwise, A'^i would contain a rooted subnetwork 
isomorphic to it and strictly contained in it, something that in finite graphs is impossible). 
The converse implication is obvious. 



Corollary 3. The mapping a is a metric on the class of all phylogenetic networks on the 
set S of taxa, that is, it satisfies the following properties: for every phylogenetic networks 
Ni,N2,N3 on the set S, 

(a) Non- negativity; (t(7Vi, 7V2) > 

(b) Separation; a{Ni,N2) = if and only if Ni ^ Nl^ 

(c) Symmetry; a{Ni,N2) = a{N2,Ni) 

(d) Triangle inequality; a{Ni,N:i) < a{Ni,N2) + a{N2, N3) 

Proof. Properties (a) and (d) are straightforward, property (b) is a consequence of the 
last theorem, and property (d) is a consequence of the triangle inequality of the symmetric 
difference of multisets. 

The computation of a has at least the same complexity as the S'-DAG isomorphism 
problem (because the latter can be decided using a), and isomorphism of general DAGs 
can be reduced to S'-DAG isomorphism. Therefore, the problem of deciding whether a 
can be computed in polynomial time for arbitrary phylogenetic networks remains open. 
But if we bound the in and out-degree of the nodes, the S'-DAG isomorphism problem is 
in P, and therefore a can be computed in polynomial time by performing a simultaneous 
bottom-up traversal of the two networks. 

4 Conclusion 

In this paper we have complemented Luay Nakhleh's latest proposal of a metric m for 
phylogenetic networks by (a) showing that m separates distinguishable networks, and (b) 
proposing a modification of its definition that provides a true metric a on the class of all 
phylogenetic networks. When both distances m and a are applied to phylogenetic trees, 
they both yield half the symmetric differences of the sets of (isomorphism classes of) 
subtrees. 

The measure m can be computed in time polynomial in the size of the networks, 
but since a can be used to decide the isomorphism problem for S-DAGs, we are lead 
to conjecture that it cannot be computed in polynomial time (as any other dissimilarity 
measure for phylogenetic networks satisfying the separation property). Any way, a can 
also computed in polynomial time on subspaces of phylogenetic trees with bounded in and 
out-degrcc. 

Given a set S of n > 2 labels, there exists no upper bound for the values of a{Ni,N2) 
and m{Ni,N2), as there exist arbitrarily large phylogenetic networks with n leaves and 
no internal node of any one of them equivalent to an internal node of the other one. 
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